BioData Mining — Latest Matching Preprints

1

OnSIDES (ON-label SIDE effectS resource) Database : Extracting Adverse Drug Events from Drug Labels using Natural Language Processing Models

Tanaka, Y.; Chen, H. Y.; Belloni, P.; Gisladottir, U.; Kefeli, J.; Patterson, J.; Srinivasan, A.; Zeitz, M.; Sirdeshmukh, G.; Berkowitz, J.; LaRow Brown, K.; Tatonetti, N. P.

2024-03-24 pharmacology and therapeutics 10.1101/2024.03.22.24304724 medRxiv

Top 0.1%

22.2%

Show abstract

Adverse drug events (ADEs) are the fourth leading cause of death in the US and cost billions of dollars annually in increased healthcare costs. However, few machine-readable databases of ADEs exist, limiting the opportunity to study drug safety on a broader, systematic scale. Recent advances in Natural Language Processing methods, such as BERT models, present an opportunity to accurately extract relevant information from unstructured biomedical text. As such, we fine-tuned a PubMedBERT model to extract ADE terms from descriptive text in FDA Structured Product Labels for prescription drugs. With this model, we achieve an F1 score of 0.90, AUROC of 0.92, and AUPR of 0.95 at extracting ADEs from the labels "Adverse Reactions". We further utilize this method to extract serious ADEs from labels "Boxed Warnings", and ADEs specifically noted for pediatric patients. Here, we present OnSIDES (ON-label SIDE effectS resource), a compiled, computable database of drug-ADE pairs generated with this method. OnSIDES contains more than 3.6 million drug-ADE pairs for 3,233 unique drug ingredient combinations extracted from 47,211 labels. Additionally, we expand this method to extract ADEs from drug labels of other major nations/regions - Japan, the UK, and the EU - to build a complementary OnSIDES-INTL database. To present potential applications, we used OnSIDES to predict novel drug targets and indications, analyze enrichment of ADEs across drug classes, and predict novel ADEs from chemical compound structures. We conclude that OnSIDES can be utilized as a comprehensive resource to study and enhance drug safety. One Sentence SummaryOnSIDES is a large, comprehensive database of adverse drug events extracted from drug labels using natural language processing methods.

2

Understanding science communication in human genetics using text mining

Morosoli, J. J.; Colodro-Conde, L.; Barlow, F. K.; Medland, S. E.

2020-07-25 scientific communication and education 10.1101/2020.07.24.219683 medRxiv

Top 0.1%

17.0%

Show abstract

We conducted the first systematic text mining review of online media coverage of genome-wide association studies (GWAS) and analyzed trends in media coverage, readability, themes, and mentions of ethical, legal, and social issues (ELSI). Over 5,000 online news articles published from 2005 to 2018 all over the world were included in analyses. Our results show that while some GWAS attract a great deal of online interest many are not reported on, and that those that are covered are described in language too complex to be understood by the general public. Ethical issues are largely unaddressed, while suggestions for translation are increasing over time. Our review identifies areas that need to improve to increase the effectiveness and accuracy of the communication of genetic research findings in online media. We have also developed a website where all results described below can be explored interactively: https://jjmorosoli.shinyapps.io/newas/.

3

Drug-drug interaction identification using large language models

Blotske, K.; Zhao, X.; Henry, K.; Gao, Y.; Tilley, A.; Cargile, M.; Murray, B.; Smith, S. E.; Barreto, E.; Bauer, S.; Sohn, S.; Liu, T.; Sikora, A.

2025-12-04 pharmacology and therapeutics 10.64898/2025.12.03.25341549 medRxiv

Top 0.1%

14.6%

Show abstract

BackgroundDrug-drug interactions (DDIs) are a significant source of morbidity and adverse drug events (ADEs), particularly in situations of polypharmacy and complex medication regimens. While rules-based software integrated in electronic health records (EHRs) has demonstrated proficiency in identifying DDIs present in medication regimens, large language model (LLM) based identification requires thorough benchmarking and performance evaluation using high-quality datasets for safe use. The purpose of this study was to develop a series of performance benchmarking experiments specifically for LLM performance in identification and management of DDIs using a specifically curated clinician-annotated dataset of clinically-relevant DDIs. MethodsWe evaluated three LLMs (GPT-4o-mini, MedGemma-27B, LLaMA3-70B) using a clinician-annotated benchmark dataset of 750 DDI scenarios spanning three levels of diagnostic complexity. Tasks were aligned with flexible judgment formats: (1) a pointwise two-drug classification task, (2) a pairwise three-drug discrimination task, and (3) a listwise 4-6 drug selection task. Standardized zero-shot prompting with task-specific instructions was applied for all models. Performance was assessed using precision, recall, F1 score, and accuracy. Reliability was quantified using self-consistency across repeated runs and confidence-aligned metrics to capture stability in model reasoning. ResultsAcross the three experiments, model performance varied by task structure and interaction severity. LLaMA3-70B demonstrated the highest recall and F1 score in the pointwise task, whereas GPT-4o-mini achieved superior accuracy and consistency in the pairwise and listwise tasks. MedGemma-27B showed competitive performance in identifying Category D interactions. Self-consistency decreased as task complexity increased, highlighting reduced stability in multi-drug reasoning. No model exhibited uniformly high reliability across all judgment formats. ConclusionsCurrent LLMs show promising but uneven capabilities in identifying DDIs across clinically relevant task structures. Performance degrades as the reasoning space expands, and stability across repeated queries remains limited. These findings emphasize the need for multi-format evaluation frameworks and reliability-aware assessment when considering LLMs for medication-safety applications.

4

Assessing the potential of ChatGPT-4 to accurately identify drug-drug interactions and provide clinical pharmacotherapy recommendations

Most, A.; Chase, A.; Sikora, A.

2024-06-30 pharmacology and therapeutics 10.1101/2024.06.29.24309701 medRxiv

Top 0.1%

14.5%

Show abstract

BackgroundLarge language models (LLMs) such as ChatGPT have emerged as promising artificial intelligence tools to support clinical decision making. The ability of ChatGPT to evaluate medication regimens, identify drug-drug interactions (DDIs), and provide clinical recommendations is unknown. The purpose of this study is to examine the performance of GPT-4 to identify clinically relevant DDIs and assess accuracy of recommendations provided. MethodsA total of 15 medication regimens were created containing commonly encountered DDIs that were considered either clinically significant or clinically unimportant. Two separate prompts were developed for medication regimen evaluation. The primary outcome was if GPT-4 identified the most relevant DDI within the medication regimen. Secondary outcomes included rating GPT-4s interaction rationale, clinical relevance ranking, and overall clinical recommendations. Interrater reliability was determined using kappa statistic. ResultsGPT-4 identified the intended DDI in 90% of medication regimens provided (27/30). GPT-4 categorized 86% as highly clinically relevant compared to 53% being categorized as highly clinically relevant by expert opinion. Inappropriate clinical recommendations potentially causing patient harm were provided in 14% of responses provided by GPT-4 (2/14), and 63% of responses contained accurate information but incomplete recommendations (19/30). ConclusionsWhile GPT-4 demonstrated promise in its ability to identify clinically relevant DDIs, application to clinical cases remains an area of investigation. Findings from this study may assist in future development and refinement of LLMs for drug-drug interaction queries to assist in clinical decision-making.

5

Hypothesis Generation For Rare and Undiagnosed Diseases Through Clustering and Classifying Time-Versioned Biological Ontologies

Bradshaw, M. S.; Gibbs, C.; Martin, S.; Firman, T.; Gaskell, A.; Fosdick, B.; Layer, R. M.

2023-11-13 bioinformatics 10.1101/2023.11.09.566432 medRxiv

Top 0.1%

14.4%

Show abstract

Rare diseases affect 1-in-10 people in the United States and despite increased genetic testing, up to half never receive a diagnosis. Even when using advanced genome sequencing platforms to discover variants, if there is no connection between the variants found in the patients genome and their phe-notypes in the literature, then the patient will remain undiagnosed. When a direct variant-phenotype connection is not known, putting a patients information in the larger context of phenotype relation-ships and protein-protein-interactions may provide an opportunity to find an indirect explanation. Databases such as STRING contain millions of protein-protein-interactions and HPO contains the relations of thousands of phenotypes. By integrating these networks and clustering the entities within we can potentially discover latent gene-to-phenotype connections. The historical records for STRING and HPO provide a unique opportunity to create a network time series for evaluating the cluster sig-nificance. Most excitingly, working with Childrens Hospital Colorado we provide promising hy-potheses about latent gene-to-phenotype connections for 38 patients with undiagnosed diseases. We also provide potential answers for 14 patients listed on MyGene2. Clusters our tool finds significant harbor 2.35 to 8.72 times as many gene-to-phenotypes edges inferred from known drug interactions than clusters find to be insignificant. Our tool, BOCC, is available as a web app and command line tool.

6

An automatic diagnostic system for pediatric genetic disorders developed by linking genotype and phenotype information

Dong, X.; Wu, B.; Wang, H.; Yang, L.; Chen, X.; Ni, Q.; Wang, Y.; Liu, B.; Lu, Y.; Zhou, W.

2021-08-28 pediatrics 10.1101/2021.08.26.21261185 medRxiv

Top 0.1%

14.3%

Show abstract

BackgroundQuantitatively describe the phenotype spectrum of pediatric disorders has remarkable power to assist genetic diagnosis. Here, we developed a matrix which provide this quantitative description of genomic-phenotypic association and constructed an automatic system to assist the diagnose of pediatric genetic disorders. Results20,580 patients with genetic diagnostic conclusions from the Childrens Hospital of Fudan University during 2015 to 2019 were reviewed. Based on that, a phenotype spectrum matrix -- cGPS (clinical Genes Preferential Synopsis) -- was designed by Naive Bayes model to quantitatively describe genes contribution to clinical phenotype categories. Further, for patients who have both genomic and phenotype data, we designed a ConsistencyScore based on cGPS. ConsistencyScore aimed to figure out genes that were more likely to be the genetic causal of the patients phenotype and to prioritize the causal gene among all candidates. When using the ConsistencyScore in each sample to predict the causal gene for patients, the AUC could reach 0.975 for ROC (95% CI 0.972-0.976 and 0.575 for precision-recall curve (95% CI 0.541-0.604). Further, the performance of ConsistencyScore was evaluated on another cohort with 2,323 patients, which could rank the causal gene of the patient as the first for 75.00% (95% CI 70.95%-79.07%) of the 296 positively genetic diagnosed patients. The causal gene of 97.64% (95% CI 95.95%-99.32%) patients could be ranked within top 10 by ConsistencyScore, which is much higher than existing algorithms (p <0.001). ConclusionscGPS and ConsistencyScore offer useful tools to prioritize disease-causing genes for pediatric disorders and show great potential in clinical applications.

7

Genetic and demographic predictors of general reading ability in two cohorts

Lancaster, H. S.; Dinu, V.; Li, J.; Gruen, J. R.; GRaD Consortium,

2021-08-29 pediatrics 10.1101/2021.08.24.21262573 medRxiv

Top 0.1%

13.4%

Show abstract

PurposeReading ability is a complex skill utilizing multiple proficiencies and that develops through interactions between genetic and environmental factors. This study presents an alternative analytic pipeline to identify key genetic and demographic contributors to reading ability. MethodsWe analyzed data from the Avon Longitudinal Study of Parents and Children (ALSPAC; N = 3 232) using a multi-step analytical pipeline. To reduce measurement error, we generated a latent reading ability score. We selected single nucleotide polymorphisms (SNPs) based on existing literature and genome-wide association studies (GWAS). We applied elastic net regression to identify informative predictors in two models, a SNP-only model and a SNP-plus demographic, environmental, and behavioral variables model. We compared the SNP-based heritability estimates and R2 from the fitted models. We also performed pathway enrichment analysis on the informative SNPs. ResultsThe traditional GWAS identified one genome-wide significant SNP on chromosome X and produced a moderate heritability estimate of .23 (SE = 0.07). We included 148 SNPs in the elastic net models. The SNP-only model identified 61 informative SNPs (R2 = .12), whereas the SNP-plus model identified 96 informative SNPs (R2 = .32). The SNP-plus model also showed that several behavioral characteristics positively predicted latent reading ability. Enrichment analysis revealed overrepresentation of several biological pathways among the informative SNPs. ConclusionsThis study shows that our analytic pipeline can identify important genetic and demographic predictors of reading ability, providing a powerful alternative to traditional methods and contributing to a deeper understanding of the factors that drive reading development.

8

A Systematic Process for Assessing Fitness-for-Purpose of Health Outcomes for Computable Phenotyping with Electronic Health Record Data

Gatto, N. M.; Cronkite, D. J.; Wartko, P. D.; Ball, R.; Carrell, D. S.; Eniafe, R.; Desai, R. M.; Floyd, J. S.; Lee, T.; Nelson, J. C.; Shebl, F. M.; Schoeplein, R.; Toh, S.; Zhang, M.; Dublin, S.; Hernandez-Munoz, J. J.

2025-09-04 pharmacology and therapeutics 10.1101/2025.08.29.25334394 medRxiv

Top 0.1%

12.4%

Show abstract

PurposeInformation from electronic health records (EHRs) may be incorporated into computable phenotype algorithms in efforts to overcome inaccuracies of algorithms based on administrative claims data alone. However, such efforts can be resource-intensive and unsuccessful. Assessing the feasibility of computable phenotyping for a health outcome of interest (HOI) before proceeding is therefore recommended. MethodsWe developed a systematic fitness-for-purpose (FFP) assessment process to implement concepts outlined in a previously described general framework for computable phenotyping incorporating EHR data. Our process includes verifying the HOI is well-defined, reviewing clinical information about the HOI, identifying existing algorithms and their performance, evaluating HOI clinical and data complexity, and determining an overall FFP conclusion and recommendation. We applied this process to ten HOIs lacking high-performing claims-based algorithms, selecting HOIs of public health importance that varied in clinical and data complexity, including neutropenia, pericardial effusion and drug-induced liver injury. ResultsHOIs assessed as having moderate (vs. easy) overall difficulty had characteristics such as the need for natural language processing, integration of multiple laboratory test results, or longitudinal EHR data. HOIs assessed as having high difficulty required using data from multiple EHR sources, ruling out many other potential causes, or relying on low-sensitivity diagnostic tests. Input from experts in EHR data and clinical care was crucial. ConclusionEHR data have potential to enhance accuracy of defining certain HOIs for research and surveillance compared to administrative claims data. The process and tools we created will support others in assessing FFP of HOIs for computable phenotyping. Five key pointsO_LIIncorporating electronic health record (EHR) data into computable phenotypes could improve accurate identification of health outcomes of interest (HOIs), but such work can be resource intensive. C_LIO_LIWe developed a systematic fitness-for-purpose (FFP) process and tools to assess the feasibility of computable phenotyping for HOIs. C_LIO_LISteps include identifying existing algorithms and their performance, ensuring the HOI is well-defined, evaluating clinical and data complexity, and determining a feasibility recommendation. C_LIO_LIDifficulty increased with a need for natural language processing, multiple laboratory tests, longitudinal EHR data, multiple EHR sources or ruling out other potential causes. C_LIO_LIInput from EHR data and clinical care experts was crucial to the FFP assessment process. C_LI Plain Language Summary (PLS)Attempts to identify diseases and health conditions by applying computer programs to information easily gleaned from insurance claims of tens of thousands of patients (such as FDAs ongoing safety monitoring of approved drugs or medical products) are often unsuccessful because the data lack nuance. Incorporating information from electronic health records (EHR) and patient chart notes may improve accurate identification of health outcomes. Because this can be resource-intensive, we designed a process and tools to assess the feasibility of including EHR data in computer algorithms to identify health outcomes. Steps included identifying existing algorithms and their performance, building familiarity with the outcome and making sure it is well-defined, evaluating clinical and data complexity, and determining a conclusion about feasibility. We applied our process to ten health outcomes of public health importance. Health outcomes were considered moderately difficult for computerized algorithms if they required natural language processing, integration of multiple laboratory tests, or EHR data from multiple timepoints. Health outcomes having high difficulty required using multiple EHR data types, ruling out many alternative causes of the HOI (other than medications), or relying on diagnostic tests of low accuracy. Input from EHR data and clinical care experts was crucial for the assessment process.

9

Comparison of missing data handling methods for variant pathogenicity predictors

Särkkä, M. I.; Myöhänen, S.; Marinov, K.; Saarinen, I.; Lahti, L.; Fortino, V.; Paananen, J.

2022-06-18 bioinformatics 10.1101/2022.06.17.496578 medRxiv

Top 0.1%

12.3%

Show abstract

1BackgroundModern clinical genetic tests utilize next-generation sequencing (NGS) approaches to comprehensively analyze genetic variants from patients. Out of these millions of variants, clinically relevant variants that match the patients phenotype need to be identified accurately within a rapid timeframe that facilitates clinical action. As manual evaluation of variants is not a feasible option for meeting the speed and volume requirements of clinical genetic testing, automated solutions are needed. Various machine learning (ML), artificial intelligence (AI), and in silico variant pathogenicity predictors have been developed to solve this challenge. These solutions rely on the comprehensiveness of the available data and struggle with the sparse nature of genetic variant data. Therefore, careful treatment of missing data is necessary, and the selected methods may have a huge impact on the accuracy, reliability, speed and associated computational costs. ResultsWe present an open-source framework called AMISS that can be used to evaluate performance of different methods for handling missing genetic variant data in the context of variant pathogenicity prediction. Using AMISS, we evaluated 14 methods for handling missing values. The performance of these methods varied substantially in terms of precision, computational costs, and other attributes. Overall, simpler imputation methods and specifically mean imputation performed best. ConclusionsSelection of the missing data handling method is crucial for AI/ML-based classification of genetic variants. We show that utilizing sophisticated imputation methods is not worth the cost when used in the context of genetic variant pathogenicity classification.

10

Improved Automatic Pharmacovigilance: An Enhancement to the MedWatcher Social System for Monitoring Adverse Events

Nguyen, A. T.; Lien, J.; Raff, E.; Mekaru, S. R.

2019-07-31 bioinformatics 10.1101/717421 medRxiv

Top 0.1%

10.6%

Show abstract

Traditional pharmacovigilance systems rely on adverse event reports received by regulatory authorities such as the United States Food and Drug Administration (FDA). These traditional systems suffer from underreporting and are not timely due to their reliance on third-party sentinels. To address these issues, the MedWatcher Social system for monitoring adverse events through automated processing of digital social media data and crowdsourcing was launched in 2012 by Boston Childrens Hospital and the FDA. The system is rooted in the well-established FDA MedWatch system.\n\nMedWatcher Social uses an indicator score approach to identify adverse events. This study evaluates the MedWatcher Social adverse event classifiers performance on Twitter data and proposes an enhancement to the indicator score method that results in improved adverse event identification.\n\nOur research suggests that automatic pharmacovigilance systems using the original indicator score approach should be updated. Careful consideration of modeling assumptions is critical when designing algorithms for computational epidemiology, and algorithms should be regularly reevaluated to identify enhancements and to remedy concept drift.

11

Transfer learning improves outcome predictions for ASD from gene expression in blood

Robasky, K.; Kim, R.; Yi, H.; Xu, H.; Bao, B.; Chang, A. W. T.; Courchesne, E.; Lewis, N. E.

2021-06-29 bioinformatics 10.1101/2021.06.26.449864 medRxiv

Top 0.1%

10.2%

Show abstract

BackgroundPredicting outcomes on human genetic studies is difficult because the number of variables (genes) is often much larger than the number of observations (human subject tissue samples). We investigated means for improving model performance on the types of under-constrained problems that are typical in human genetics, where the number of strongly correlated genes (features) may exceed 10,000, and the number of study participants (observations) may be limited to under 1,000. MethodsWe created train, validate and test datasets from 240 microarray observations from 127 subjects diagnosed with autism spectrum disorder (ASD) and 113 typically developing (TD) subjects. We trained a neural network model (a.k.a., the naive model) on 10,422 genes using the train dataset, composed of 70 ASD and 65 TD subjects, and we restricted the model to one, fully-connected hidden layer to minimize the number of trainable parameters, including a dropout layer to help prevent overfitting. We experimented with alternative network architectures and tuned the hyperparameters using the validate dataset, and performed a single, final evaluation using the holdout test dataset. Next, we trained a neural network model using the identical architecture and identical genes to predict tissue type in GTEx data. We transferred that learning by replacing the top layer of the GTEx model with a layer to predict ASD outcome and we retrained the new layer on the ASD dataset, again using the identical 10,422 genes. FindingsThe naive neural network model had AUROC=0.58 for the task of predicting ASD outcomes, which saw a statistically significant 7.8% improvement from transfer learning. InterpretationWe demonstrated that neural network learning could be transferred from models trained on large RNA-Seq gene expression to a model trained on a small, microarray gene expression dataset with clinical utility for mitigating over-training on small sample sizes. Incidentally, we built a highly accurate classifier of tissue type with which to perform the transfer learning. FundingThis work was supported in part by NIMH R01-MH110558 (E.C., N.E.L.) Author SummaryImage recognition and natural language processing have enjoyed great success in reusing the computational efforts and data sources to overcome the problem of over-training a neural network on a limited dataset. Other domains using deep learning, including genomics and clinical applications, have been slower to benefit from transfer learning. Here we demonstrate data preparation and modeling techniques that allow genomics researchers to take advantage of transfer learning in order to increase the utility of limited clinical datasets. We show that a non-pre-trained, naive model performance can be improved by 7.8% by transferring learning from a highly performant model trained on GTEx data to solve a similar problem.

12

Comparing the XGBoost machine learning algorithm to polygenic scoring for the prediction of intelligence based on genotype data

Fahey, L.; Morris, D. W.; O Broin, P.

2022-06-15 bioinformatics 10.1101/2022.06.12.495467 medRxiv

Top 0.1%

10.2%

Show abstract

A polygenic score (PGS) is a linear combination of effects from a GWAS that represents and can be used to predict genetic predisposition to a particular phenotype. A key limitation of the PGS method is that it assumes additive and independent SNP effects, when it is known that epistasis (gene interactions) can contribute to complex traits. Machine learning methods can potentially overcome this limitation by virtue of their ability to capture nonlinear interactions in high dimensional data. Intelligence is a complex trait for which PGS prediction currently explains up to 5.2% of the variance, a relatively small proportion of the heritability estimate of 50% obtained from twin studies. Here, we use gradient boosting, a machine learning technique based on an ensemble of weak prediction models, to predict intelligence from genotype data. We found that while gradient boosting did not outperform the PGS method in predicting intelligence based on SNP data, it was capable of achieving similar predictive performance with less than a quarter of the SNPs with the top SNPs identified as being important for predictive performance being biologically meaningful. These results indicate that ML methods may be useful in interpreting the biological meaning underpinning SNP-phenotype associations due to the smaller number of SNPs required in the ML model as opposed to the standard PGS method based on GWAS.

13

Can machine learning aid in identifying disease genes? The case of autism spectrum disorder

Gunning, M.; Pavlidis, P.

2020-11-27 bioinformatics 10.1101/2020.11.26.394676 medRxiv

Top 0.1%

10.2%

Show abstract

Discovering genes involved in complex human genetic disorders is a major challenge. Many have suggested that machine learning (ML) algorithms using gene networks can be used to supplement traditional genetic association-based approaches to predict or prioritize disease genes. However, questions have been raised about the utility of ML methods for this type of task due to biases within the data, and poor real-world performance. Using autism spectrum disorder (ASD) as a test case, we sought to investigate the question: Can machine learning aid in the discovery of disease genes? We collected thirteen published ASD gene prioritization studies and evaluated their performance using known and novel high-confidence ASD genes. We also investigated their biases towards generic gene annotations, like number of association publications. We found that ML methods which do not incorporate genetics information have limited utility for prioritization of ASD risk genes. These studies perform at a comparable level to generic measures of likelihood for the involvement of genes in any condition, and do not out-perform genetic association studies. Future efforts to discover disease genes should be focused on developing and validating statistical models for genetic association, specifically for association between rare variants and disease, rather than developing complex machine learning methods using complex heterogeneous biological data with unknown reliability.

14

Automated Evaluation of Antibiotic Prescribing Guideline Concordance in Pediatric Sinusitis Clinical Notes

Weissenbacher, D.; Dutcher, L.; Boustany, M.; Cressman, L.; OCOnnor, K.; Hamilton, K. W.; Gerber, J.; Grundmeier, R.; Gonzalez-Hernandez, G.

2024-08-09 pediatrics 10.1101/2024.08.09.24311714 medRxiv

Top 0.1%

10.2%

Show abstract

BackgroundEnsuring antibiotics are prescribed only when necessary is crucial for maintaining their effectiveness and is a key focus of public health initiatives worldwide. In cases of sinusitis, among the most common reasons for antibiotic prescriptions in children, health-care providers must distinguish between bacterial and viral causes based on clinical signs and symptoms. However, due to the overlap between symptoms of acute sinusitis and viral upper respiratory infections, antibiotics are often over-prescribed. ObjectivesCurrently, there are no electronic health record (EHR)-based methods, such as lab tests or ICD-10 codes, to retroactively assess the appropriateness of these prescriptions, making manual chart reviews the only available method for evaluation, which is time-intensive and not feasible at a large scale. In this study, we propose using natural language processing to automate this assessment. MethodsWe developed, trained, and evaluated generative models to classify the appropriateness of antibiotic prescriptions in 300 clinical notes from pediatric patients with sinusitis seen at a primary care practice in the Childrens Hospital of Philadelphia network. We utilized standard prompt engineering techniques, including few-shot learning and chain-of-thought prompting, to refine an initial prompt. Additionally, we employed Parameter-Efficient Fine-Tuning to train a medium-sized generative model Llama 3 70B-instruct. ResultsWhile parameter-efficient fine-tuning did not enhance performance, the combination of few-shot learning and chain-of-thought prompting proved beneficial. Our best results were achieved using the largest generative model publicly available to date, the Llama 3.1 405B-instruct. On our test set, the model correctly identified 91.4% of the 35 notes where antibiotic prescription was appropriate and 71.4% of the 14 notes where it was not appropriate. However, notes that were insufficiently, vaguely, or ambiguously documented by physicians posed a challenge to our model, as none evaluation sets were accurately classified. ConclusionOur generative model demonstrated strong performance in the challenging task of chart review. This level of performance may be sufficient for deploying the model within the EHR, where it can assist physicians in real-time to prescribe antibiotics in concordance with the guidelines, or for monitoring antibiotic stewardship on a large scale.

15

acorn: an R package for de novo variant analysis

Turner, T. N.

2023-04-12 bioinformatics 10.1101/2023.04.11.536422 medRxiv

Top 0.1%

10.0%

Show abstract

BackgroundThe study of de novo variation is important for assessing biological characteristics of new variation and for studies related to human phenotypes. Software programs exist to call de novo variants and programs also exist to test the burden of these variants in genomic regions; however, I am unaware of a program that fits in between these two aspects of de novo variant assessment. This intermediate space is important for assessing the quality of de novo variants and to understand the characteristics of the callsets. For this reason, I developed the R package acorn. Resultsacorn is an R package that examines various features of de novo variants including subsetting the data by individual(s), variant type, or genomic region; calculating features including variant change counts, variant lengths, and presence/absence at CpG sites; and characteristics of parental age in relation to de novo variant counts. Conclusionsacorn is an R package that fills a critical gap in assessing de novo variants and will be of benefit to many investigators studying de novo variation.

16

simmr: An open-source tool to perform simulations in Mendelian Randomization

Lorincz-Comi, N. J.; Yang, Y.; Zhu, X.

2023-09-15 bioinformatics 10.1101/2023.09.11.556975 medRxiv

Top 0.1%

9.9%

Show abstract

Mendelian Randomization (MR) has become a popular tool for inferring causality of risk factors on disease. There are currently over 45 different methods available to perform MR, reflecting this extremely active research area. It would be desirable to have a standard simulation environment to objectively evaluate the existing and future methods. We present simmr, an open-source software for performing simulations to evaluate the performance of MR methods in a range of scenarios encountered in practice. Researchers can directly modify the simmr source code so that the research community may arrive at a widely accepted frame-work for researchers to evaluate the performance of different MR methods.

17

Zero-Shot Evaluation of Kimi K2 on Pediatric Clinical Cases

Mondillo, G.; Masino, M.; Colosimo, S.; Perrotta, A.; Frattolillo, V.; Abbate, F. G.

2025-07-29 pediatrics 10.1101/2025.07.29.25332368 medRxiv

Top 0.1%

9.4%

Show abstract

BackgroundThe application of large language models (LLMs) in pediatric medicine requires rigorous performance evaluation prior to clinical implementation. ObjectiveTo evaluate the accuracy of the Kimi K2 model in analyzing pediatric clinical cases using a zero-shot approach. Methods: 2,249 multiple-choice questions from pediatric clinical cases, ranging in age from 1 day to 16 years, extracted from the MedQA dataset were analyzed. The model was tested via API with standardized parameters, temperature set to zero, and zero-shot prompts. Accuracy was calculated by comparing the responses with the datasets ground truth. ResultsKimi K2 achieved an overall accuracy of 78.39%, corresponding to 1,763 correct answers out of 2,249 total, with 100% of responses in the required format. Conclusions: The model demonstrates competitive performance for medical education and diagnostic support, while still having limitations that require human clinical supervision.

18

Closing the Pediatric Divide: A Performance Analysis of the GPT-5 Family in Medical Diagnostics

Mondillo, G.; Abbate, F. G.; Masino, M.; Colosimo, S.; Perrotta, A.; Frattolillo, V.

2025-08-29 pediatrics 10.1101/2025.08.28.25334657 medRxiv

Top 0.1%

9.0%

Show abstract

AO_SCPLOWBSTRACTC_SCPLOWO_ST_ABSBackgroundC_ST_ABSLarge Language Models (LLMs) have demonstrated significant potential in clinical medicine, but a persistent performance gap exists in the pediatric domain due to its unique complexities. This study provides the first comparative evaluation of the new GPT-5 family (Nano, Mini, and full) to assess the impact of model scale on diagnostic accuracy and this specific adult-pediatric disparity. MethodsA benchmarking study was conducted using 2,000 multiple-choice questions from the MedQA dataset, equally divided between adult (n=1,000) and pediatric (n=1,000) domains. GPT-5, GPT-5 Mini, and GPT-5 Nano were tested via API with standardized parameters (temperature=0, reasoning effort=minimal, verbosity=low, maxtoken=170). Accuracy was calculated and statistically compared across domains for each model. ResultsA clear dose-response relationship was observed between model size and accuracy. GPT-5 Nano exhibited a significant performance gap, with an accuracy of 71.0% in adult medicine versus 55.4% in pediatrics (a 15.6 percentage point difference, p<0.001). GPT-5 Mini substantially narrowed this gap to 5.7 points (81.5% vs. 75.8%, p=0.001). Critically, the full GPT-5 model eliminated the disparity, achieving comparable accuracy in adult medicine (86.3%) and slightly higher accuracy in pediatrics (88.5%) (p=0.138). Performance gains from scaling up were disproportionately larger for the pediatric domain. ConclusionThe GPT-5 family marks a substantial advancement in medical AI. The full-size model not only achieves high diagnostic accuracy but, crucially, overcomes the previously documented performance limitations in pediatrics. This demonstrates that sufficient model scale is vital for mastering the nuances of specialized clinical domains. These findings support a tiered implementation strategy based on task criticality and underscore the need for continued validation in real-world clinical settings.

19

Predicting Mental and Psychomotor Delay in Very Pre-term Infants using Large Language Models

Huang, Z.; Flory, M. J.; Kittler, P. M.; Phan, H. T.; Demirci, G. M.; Gordon, A. D.; Parab, S. M.; Tsai, C.-L.

2025-08-02 pediatrics 10.1101/2025.07.31.25332524 medRxiv

Top 0.1%

8.9%

Show abstract

Very preterm infants face a considerably higher risk of neurodevelopmental delays, making early diagnosis and timely intervention crucial for improving long-term outcomes. In this study, we utilized large language models (LLMs) to predict mental and psychomotor delays at 25 months using maternal and perinatal records combined with longitudinal features up to 22 months of age. The LLMs were employed to generate natural language descriptions for each infant, which were then used as input for a language model-based classifier to perform predictions. Our model achieved a 4.2% increase in AUCROC in mental delay prediction and 3.2% increase in psychomotor delay prediction 3 months before the 25-month assessment, compared to a random forest-based model for numerical tabular data only. These findings highlight the potential of LLMs as powerful tools for assessing the risk of neurodevelopmental delays in preterm infants.

20

Quantitative Genetic Scoring, or how to put a number on an arbitrary genetic region

Schoenmacker, G. H.; Vlaming, P.; Pallesen, J.; Pikulina, M. Y.; Ghamarian, A. H.; Demontis, D. H.; Borglum, A.; Galesloot, T. E.; Poelmans, G.; Franke, B.; Claassen, T.; Heskes, T.; Buitelaar, J.; Arias Vasquez, A.

2021-01-01 bioinformatics 10.1101/2020.12.15.422886 medRxiv

Top 0.1%

8.8%

Show abstract

MotivationWith the increasing availability of genome-wide genetic data, methods to combine genetic variables with other sources of data in statistical models are required. This paper introduces quantitative genetic scoring (QGS), a dimensionality reduction method to create quantitative genetic variables representing arbitrary genetic regions. MethodsQGS is defined as the sum of absolute differences in the genetic sequence between a subject and a reference population. QGS properties such as distribution and sensitivity to region size were examined, and QGS was tested in six different existing genomic data sets of various sizes and various phenotypes. ResultsQGS can reduce genetic information by >98% yet explain phenotypic variance at low, medium, and high level of granularity. Associations based on QGS are independent of both size and linkage disequilibrium structure of the underlying region. In combination with stability selection, QGS finds significant results where a traditional genome-wide association approaches struggle. In conclusion, QGS preserves phenotypically significant genetic variance while reducing dimensionality, allowing researchers to include quantitative genetic information in any type of statistical analysis. Availabilityhttps://github.com/machine2learn/QGS Contactgido.schoenmacker@radboudumc.nl Supplemental informationSupplemental data are available online.